{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# [kaggle][学习向]sf-crime数据的多层神经网络（其一）\n",
    "\n",
    "在上一篇[kaggle][学习向]sf-crime数据的单层神经网络中，我们完成了一个简单的神经网络，完成了第一次的训练、预测与提交，但是这不够，我们需要更好，这个比赛的数据更适合传统的机器学习方法，笔者在进行多次整理、加深、调参之后停留在了Top30%，（但是已经是这个比赛里面使用NN的notebook里面最好的成绩了），相信大家能够超越我，加油！\n",
    "\n",
    "在上一篇中，我们的神经网络已经能够完成整个流程，那么我们应该怎样提升它呢？\n",
    "\n",
    "一、需要注意到，此比赛使用的评分标准是log_loss函数，而我们使用的是torch的CrossEntropyLoss，这不适合我们评断我们的训练效果\n",
    "\n",
    "二、kaggle比赛的测试集是没有标签的，也就是说，除了提交，我们没有一个很好的方式判断我们训练的效果\n",
    "\n",
    "三、引入GPU的使用\n",
    "\n",
    "所以我们需要整理一次项目，为后面的调整做铺垫\n",
    "\n",
    "让我们开始\n",
    "\n",
    "导入所需要的库："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import torch\n",
    "from torch import nn\n",
    "from sklearn.model_selection import KFold"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "导入处理好的训练集的特征和标签："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_features = np.load('./data/train_features.npy')\n",
    "train_labels = np.load('./data/train_labels_onehot.npy')\n",
    "test_features = np.load('./data/test_features.npy')\n",
    "\n",
    "num_inputs = 21\n",
    "num_outputs = 39"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 自定义损失loss函数\n",
    "\n",
    "log_loss是在机器学习构建分类模型的任务中经常使用的损失度量方法，公式是：\n",
    "\n",
    "让我们自定义一个log_loss："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MultiClassLogLoss(torch.nn.Module):\n",
    "    def __init__(self):\n",
    "        super(MultiClassLogLoss, self).__init__()\n",
    "\n",
    "    def forward(self, y_pred, y_true):\n",
    "        return -(y_true *\n",
    "                 torch.log(y_pred.float() + 1.00000000e-15)) / y_true.shape[0]\n",
    "# 防止出现log(0)，加1*10^-15\n",
    "\n",
    "loss = MultiClassLogLoss()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 定义模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "class build_model(torch.nn.Module):\n",
    "    def __init__(self, num_inputs, num_outputs):\n",
    "        super(build_model, self).__init__()\n",
    "        self.net = torch.nn.Sequential()\n",
    "        self.net.add_module('Linear', nn.Linear(num_inputs, num_outputs))\n",
    "        self.net.add_module('Softmax', nn.Softmax(dim=-1))\n",
    "\n",
    "    def forward(self, x):\n",
    "        return self.net(x)\n",
    "    \n",
    "net = build_model(num_inputs, num_outputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 批量读取数据函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_iter(train_features, train_labels, batch_size):\n",
    "    train_features = torch.tensor(train_features, dtype=torch.float)\n",
    "    train_labels = torch.tensor(train_labels)\n",
    "    dataset = torch.utils.data.TensorDataset(train_features, train_labels)\n",
    "    return torch.utils.data.DataLoader(dataset, batch_size, shuffle=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 训练/泛化误差计算函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_loss(net, loss, features, labels, team):\n",
    "    net.eval()\n",
    "    batch = make_iter(features, labels, 1024)\n",
    "    loss_num = 0\n",
    "    n = 0\n",
    "    for x, y in batch:\n",
    "        loss_num += loss(net(x), y).sum().item()\n",
    "        n += 1\n",
    "    print(team, end=' ')\n",
    "    print('loss:', loss_num / n)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 训练函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train(features, labels, batch_size):\n",
    "    net.train()\n",
    "    train_iter = make_iter(features, labels, batch_size)\n",
    "    for X, y in train_iter:\n",
    "        y_hat = net(X)\n",
    "        l = loss(y_hat, y).sum()\n",
    "        optimizer.zero_grad()\n",
    "        l.backward()\n",
    "        optimizer.step()\n",
    "    show_loss(net, loss, features, labels, '训练集')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 设置超参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_epochs = 5\n",
    "k_fold_num = 5\n",
    "batch_size = 32\n",
    "lr = 0.1\n",
    "k_fold = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 定义优化函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "optimizer = torch.optim.SGD(net.parameters(), lr=lr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 引入K折交叉验证\n",
    "\n",
    "用同一数据集，既进行训练，又进行模型误差估计，对误差的估计会出现不准确的问题，这就是所谓的模型误差估计的乐观性。为了克服这个问题，我们引入了交叉验证。\n",
    "\n",
    "基本思想是将数据分为两部分，一部分数据用来模型的训练，称为训练集；另外一部分用于测试模型的误差，称为验证集。\n",
    "\n",
    "由于两部分数据不同，估计得到的泛化误差更接近真实的模型表现。能更好的评估模型的效果。\n",
    "\n",
    "## 开始训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第1轮的第1折：\n",
      "训练集 loss: 2.5503316954343034\n",
      "测试集 loss: 2.5525896840317306\n",
      "第1轮的第2折：\n",
      "训练集 loss: 2.549008633930551\n",
      "测试集 loss: 2.5492086715476456\n",
      "第1轮的第3折：\n",
      "训练集 loss: 2.5463152279311645\n",
      "测试集 loss: 2.5483997782995536\n",
      "第1轮的第4折：\n",
      "训练集 loss: 2.546247698475251\n",
      "测试集 loss: 2.548039274160252\n",
      "第1轮的第5折：\n",
      "训练集 loss: 2.548516502185744\n",
      "测试集 loss: 2.5459297632062157\n",
      "第2轮的第1折：\n",
      "训练集 loss: 2.5463628101626914\n",
      "测试集 loss: 2.548425505327624\n",
      "第2轮的第2折：\n",
      "训练集 loss: 2.546687583881634\n",
      "测试集 loss: 2.5528524781382362\n",
      "第2轮的第3折：\n",
      "训练集 loss: 2.5462778594681543\n",
      "测试集 loss: 2.543629068274831\n",
      "第2轮的第4折：\n",
      "训练集 loss: 2.54511651283798\n",
      "测试集 loss: 2.545532415079516\n",
      "第2轮的第5折：\n",
      "训练集 loss: 2.5477974574697955\n",
      "测试集 loss: 2.5463107075802114\n",
      "第3轮的第1折：\n",
      "训练集 loss: 2.5450612396957575\n",
      "测试集 loss: 2.5497251097546068\n",
      "第3轮的第2折：\n",
      "训练集 loss: 2.5462568018248755\n",
      "测试集 loss: 2.547771273657333\n",
      "第3轮的第3折：\n",
      "训练集 loss: 2.545838413363643\n",
      "测试集 loss: 2.544646130051724\n",
      "第3轮的第4折：\n",
      "训练集 loss: 2.545403643182693\n",
      "测试集 loss: 2.5437340140342712\n",
      "第3轮的第5折：\n",
      "训练集 loss: 2.5455294635483545\n",
      "测试集 loss: 2.5471859920856565\n",
      "第4轮的第1折：\n",
      "训练集 loss: 2.5473473933973394\n",
      "测试集 loss: 2.54685703682345\n",
      "第4轮的第2折：\n",
      "训练集 loss: 2.546726713375169\n",
      "测试集 loss: 2.544034934321115\n",
      "第4轮的第3折：\n",
      "训练集 loss: 2.5443586371730436\n",
      "测试集 loss: 2.548985499282216\n",
      "第4轮的第4折：\n",
      "训练集 loss: 2.545617711439772\n",
      "测试集 loss: 2.547614945921787\n",
      "第4轮的第5折：\n",
      "训练集 loss: 2.544828007242075\n",
      "测试集 loss: 2.544772464175557\n",
      "第5轮的第1折：\n",
      "训练集 loss: 2.545793165270858\n",
      "测试集 loss: 2.5451489035473314\n",
      "第5轮的第2折：\n",
      "训练集 loss: 2.5436720309382626\n",
      "测试集 loss: 2.5482053590375324\n",
      "第5轮的第3折：\n",
      "训练集 loss: 2.5440436987418127\n",
      "测试集 loss: 2.5448591612106144\n",
      "第5轮的第4折：\n",
      "训练集 loss: 2.5444287801970544\n",
      "测试集 loss: 2.5450654279354006\n",
      "第5轮的第5折：\n",
      "训练集 loss: 2.5461313164964015\n",
      "测试集 loss: 2.5441392119540724\n"
     ]
    }
   ],
   "source": [
    "if k_fold:\n",
    "    kf = KFold(n_splits=k_fold_num, shuffle=True)\n",
    "    for epoch in range(num_epochs):\n",
    "        fold_num = 0\n",
    "        for train_index, test_index in kf.split(train_features):\n",
    "            X_train, X_test = train_features[train_index], train_features[\n",
    "                test_index]\n",
    "            y_train, y_test = train_labels[train_index], train_labels[\n",
    "                test_index]\n",
    "            print('第%d轮的第%d折：' % (epoch + 1, fold_num + 1))\n",
    "            fold_num += 1\n",
    "            train(X_train, y_train, batch_size)\n",
    "            show_loss(net, loss, X_test, y_test, '测试集')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "观察训练过程中loss的变化，我们可以发现，训练误差和泛化误差差值不大，即暂没出现过拟合现象，但是训练集误差徘徊在2.545附近，还有进步空间，由于并未改变网络结构，我们暂时不得到输出"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 再次抽象\n",
    "\n",
    "并引入GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "class sf_crime():\n",
    "    def __init__(self, num_epochs, k_fold_num, batch_size, k_fold):\n",
    "        self.num_epochs = num_epochs\n",
    "        self.k_fold_num = k_fold_num\n",
    "        self.batch_size = batch_size\n",
    "        self.k_fold = k_fold\n",
    "        self.run()\n",
    "\n",
    "    def make_iter(self, train_features, train_labels):\n",
    "        train_features = torch.tensor(train_features, dtype=torch.float).to(device)\n",
    "        train_labels = torch.tensor(train_labels).to(device)\n",
    "        dataset = torch.utils.data.TensorDataset(train_features, train_labels)\n",
    "        return torch.utils.data.DataLoader(dataset, self.batch_size, shuffle=True)\n",
    "\n",
    "    def show_loss(self, features, labels, team):\n",
    "        net.eval()\n",
    "        batch = self.make_iter(features, labels)\n",
    "        loss_num = 0\n",
    "        n = 0\n",
    "        for x, y in batch:\n",
    "            loss_num += loss(net(x), y).sum().item()\n",
    "            n += 1\n",
    "        print(team, end=' ')\n",
    "        print('loss:', loss_num / n)\n",
    "\n",
    "    def train(self, features, labels):\n",
    "        net.train()\n",
    "        train_iter = self.make_iter(features, labels)\n",
    "        for X, y in train_iter:\n",
    "            y_hat = net(X)\n",
    "            l = loss(y_hat, y).sum()\n",
    "            optimizer.zero_grad()\n",
    "            l.backward()\n",
    "            optimizer.step()\n",
    "        self.show_loss(features, labels, '训练集')\n",
    "\n",
    "    def run(self):\n",
    "        if self.k_fold:\n",
    "            kf = KFold(n_splits=self.k_fold_num, shuffle=True)\n",
    "            for epoch in range(self.num_epochs):\n",
    "                fold_num = 0\n",
    "                for train_index, test_index in kf.split(train_features):\n",
    "                    X_train, X_test = train_features[train_index], train_features[\n",
    "                        test_index]\n",
    "                    y_train, y_test = train_labels[train_index], train_labels[\n",
    "                        test_index]\n",
    "                    print('第%d轮的第%d折：' % (epoch + 1, fold_num + 1))\n",
    "                    fold_num += 1\n",
    "                    self.train(X_train, y_train)\n",
    "                    self.show_loss(X_test, y_test, '测试集')\n",
    "        else:\n",
    "            for epoch in range(self.num_epochs):\n",
    "                print('第%d轮：' % (epoch + 1))\n",
    "                self.train(train_features, train_labels)\n",
    "\n",
    "    def write(self, version):\n",
    "        net.eval()\n",
    "        test_iter = torch.utils.data.DataLoader(torch.tensor(test_features,\n",
    "                                                             dtype=torch.float).to(device),\n",
    "                                                1024,\n",
    "                                                shuffle=False)\n",
    "        testResult = [line for x in test_iter for line in net(x).cpu().detach().numpy()]\n",
    "        sampleSubmission = pd.read_csv('../input/sf-crime/sampleSubmission.csv.zip')\n",
    "        Result_pd = pd.DataFrame(testResult,\n",
    "                                 index=sampleSubmission.index,\n",
    "                                 columns=sampleSubmission.columns[1:])\n",
    "        Result_pd.to_csv('../working/sampleSubmission('+str(version)+').csv', index_label='Id')\n",
    "        torch.save(net, '../working/net('+str(version)+').pkl')\n",
    "        print('Finish!')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6.10 64-bit ('Pytorch-learn': conda)",
   "language": "python",
   "name": "python361064bitpytorchlearnconda080df47efea24539a61202fa66a72562"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
